Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for selecting a single complex field array and its parent struct array [databricks] #8744

Merged
merged 13 commits into from
Aug 8, 2023

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Jul 18, 2023

PR tries to mimic the test in Spark's SchemaPruningSuite.

We create a table with complex fields and try to read only a subfield to see if column pruning is working the way it's supposed to i.e. we are not reading unnecessary columns. e.g. if we have a complex type contact with an array of friends, like so

Contact {
   first_name: String
   middle_name: String
   last_name: String
   friends: Array[Contact]
}

selecting spark.table("contacts").select(explode("friends").alias("friend").select("friend.first_name") shouldn't also read the middle_name and last_name of the friends field.

fixes #8712
fixes #8713
fixes #8714
fixes #8715

@razajafri razajafri self-assigned this Jul 18, 2023
@mythrocks
Copy link
Collaborator

It would be good for description in this PR (if not also the issue #8712) to be more descriptive of the change.

The test here is a sampling of tests covered in SchemaPruningSuite. The objective is to cover some scenarios where column pruning might be tripped up. SchemaPruningSuite does this by selecting different combinations of column (and sub-column) projections.

You should consider combining these tests into a single change. All that's changing between these tests is the output projection.

@razajafri razajafri marked this pull request as draft July 18, 2023 20:03
@mythrocks
Copy link
Collaborator

mythrocks commented Jul 18, 2023

Adding to this for clarity:

  1. There are 4 scenarios listed in [TEST] Compatibility tests for data formats #8666, sampled from SchemaPruningSuite. I don't think it's necessary to exhaustively run all the cases in SchemaPruningSuite. We just need a few (like these 4) to test that the pruning is working.
  2. Some effort will be needed to do the equivalent of checkScan and checkScanSchemata. We'll just need to check that the file-scan operator in question has the expected read schema, just like in the test.

@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri marked this pull request as ready for review July 22, 2023 00:52
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

connection failure in premerge

@sameerz sameerz added the test Only impacts tests label Jul 26, 2023
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

premerge failing with the following Exception which seems to be unrelated to this change. I also see another PR failing with the same exception

[2023-07-26T21:17:28.616Z] [1m[31m../../src/main/python/parquet_testing_test.py[0m:116: in <module>
[2023-07-26T21:17:28.616Z]     @pytest.mark.parametrize("path", gen_testing_params_for_valid_files())
[2023-07-26T21:17:28.616Z] [1m[31m../../src/main/python/parquet_testing_test.py[0m:105: in gen_testing_params_for_valid_files
[2023-07-26T21:17:28.616Z]     for f in locate_parquet_testing_files():
[2023-07-26T21:17:28.616Z] [1m[31m../../src/main/python/parquet_testing_test.py[0m:91: in locate_parquet_testing_files
[2023-07-26T21:17:28.616Z]     raise AssertionError("Cannot find parquet-testing data in any of: " + locations)
[2023-07-26T21:17:28.616Z] [1m[31mE   AssertionError: Cannot find parquet-testing data in any of: /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-7623-ci-2/integration_tests/src/test/resources/parquet-testing/data/*.parquet, /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-7623-ci-2/integration_tests/src/test/resources/parquet-testing/bad_data/*.parquet, /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-7623-ci-2/thirdparty/parquet-testing/data/*.parquet, /home/jenkins/agent/workspace/jenkins-

@razajafri
Copy link
Collaborator Author

build

1 similar comment
@pxLi
Copy link
Collaborator

pxLi commented Jul 27, 2023

build

@revans2
Copy link
Collaborator

revans2 commented Jul 27, 2023

@razajafri try to upmerge and then run the premerge tests again.

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

I am looking into the CI failure. From the logs, the test that I added is possibly crashing the JVM which is causing
subsequent test failures but everything passes on my local workstation.

It could just be a coincidence that the connection dropped so I will try to kick this off one more time.

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

Self-bootstrap failure during launch. Please try again later and contact Databricks if the problem persists. Node daemon fast failed and did not answer ping

@razajafri
Copy link
Collaborator Author

build

2 similar comments
@pxLi
Copy link
Collaborator

pxLi commented Aug 2, 2023

build

@pxLi
Copy link
Collaborator

pxLi commented Aug 2, 2023

build

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@mythrocks @revans2 can you please take another look

@razajafri razajafri merged commit 1fb5fc4 into NVIDIA:branch-23.08 Aug 8, 2023
@razajafri razajafri deleted the SP-8712-schema-pruning-test branch August 8, 2023 16:49
jlowe added a commit to jlowe/spark-rapids that referenced this pull request Aug 11, 2023
…arent struct array [databricks] (NVIDIA#8744)"

This reverts commit 1fb5fc4.

Signed-off-by: Jason Lowe <[email protected]>
razajafri pushed a commit that referenced this pull request Aug 11, 2023
* Revert "Create an anonymous subclass of AdaptiveSparkPlanHelper in ExecutionPlanCaptureCallback.scala [databricks] (#8977)"

This reverts commit bcc929c.

Signed-off-by: Jason Lowe <[email protected]>

* Revert "Add test for selecting a single complex field array and its parent struct array [databricks] (#8744)"

This reverts commit 1fb5fc4.

Signed-off-by: Jason Lowe <[email protected]>

---------

Signed-off-by: Jason Lowe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment